Ana Luisa Pinheiro 11810407
Ayrton Amaral 11288131
Bruno Groper Morbin 11809875
Caio Febronio 11811482

Instituto de Matemática e Estatística - Universidade de São Paulo | Julho, 2023


Descrição do estudo

Now, the connection between clusterization and entropy arises from the idea that when you successfully cluster a dataset, you are effectively reducing the uncertainty or randomness within each cluster. A well-defined cluster contains data points that are similar to each other and dissimilar to points in other clusters. This reduction in uncertainty can be seen as a form of data compression.

When you have a dataset with well-separated clusters, you can encode each cluster with a shorter representation (e.g., a cluster centroid or a label) instead of encoding each individual data point. This compression reduces the amount of information required to represent the dataset as a whole. Consequently, the entropy of the clustered dataset is lower than the entropy of the original, unclustered dataset.

In other words, clusterization can be seen as a form of data compression that aims to maximize the compression ratio by grouping similar data points together, thus reducing the entropy of the data.

1 Problema

The entropy formula itself is not directly used to separate data into clusters. Instead, it is used to quantify the uncertainty or randomness within a cluster or a distribution. To separate data into clusters, you would typically use clustering algorithms or techniques such as K-means, hierarchical clustering, or DBSCAN, among others.

However, once you have obtained the clusters using a clustering algorithm, you can calculate the entropy of each cluster to assess the degree of uncertainty or randomness within that cluster. Here’s a general approach:

  1. Perform Clustering: Apply a clustering algorithm of your choice to partition the data into clusters. Each data point will be assigned to a specific cluster based on some similarity or distance metric.

  2. Calculate Cluster Entropy: Once you have the clusters, you can calculate the entropy for each cluster individually. To do this, follow these steps:

    1. For each cluster, compute the probability distribution of the classes or values within that cluster. This distribution represents the relative frequencies of different classes or values within the cluster.
    2. Use the entropy formula (H(X) = - Σ P(x) * log2(P(x))) to calculate the entropy of each cluster based on its probability distribution.
    3. The entropy value will indicate the level of uncertainty or randomness within each cluster. Lower entropy values indicate more homogeneous clusters, where data points are similar to each other and have a predictable distribution, while higher entropy values indicate more diverse or mixed clusters.
  3. Analyze Results: Examine the entropy values of the clusters to gain insights into the clustering quality. Lower entropy clusters are generally considered more well-defined, while higher entropy clusters may indicate more ambiguity or overlap between data points.

Keep in mind that clustering and entropy calculation are separate steps in the analysis. Clustering determines the assignment of data points to clusters, while entropy provides a measure of uncertainty or randomness within each cluster.

2 Proposta

Information gain is another concept from information theory that can be used to evaluate the quality of clustering or the effectiveness of features in separating data into clusters. It measures the reduction in entropy achieved by partitioning the data based on a particular feature or attribute.

Here’s how information gain can be used in the context of clustering:

  1. Calculate the Initial Entropy: Calculate the entropy of the target variable or the distribution of classes in the entire dataset before clustering. This will serve as the baseline entropy.

  2. Perform Clustering: Apply a clustering algorithm to partition the data into clusters based on a set of features or attributes. Each data point will be assigned to a specific cluster.

  3. Calculate Cluster Entropy: For each cluster obtained from the clustering algorithm, calculate the entropy of the target variable within that cluster. This will involve computing the probability distribution of classes within each cluster and then calculating the entropy using the formula: H(X) = - Σ P(x) * log2(P(x)).

  4. Calculate Information Gain: Information gain is calculated by comparing the initial entropy (step 1) with the entropy of each cluster (step 3). The information gain achieved by partitioning the data based on a particular feature or attribute is given by:

    Information Gain = Initial Entropy - Σ (Proportion of data in each cluster * Cluster Entropy)

    The proportion of data in each cluster can be determined by dividing the number of data points in the cluster by the total number of data points.

  5. Evaluate Information Gain: Higher information gain indicates that the feature or attribute used for clustering has effectively reduced the uncertainty or randomness within the clusters. It suggests that the chosen feature provides valuable information for separating the data into distinct clusters.

  6. Iterative Feature Selection: You can repeat steps 2 to 5 with different features or attributes to compare their information gain values. This can help in identifying the most informative features for clustering or in prioritizing the order of feature selection.

Information gain is particularly useful in decision tree-based clustering algorithms, where features are recursively selected to optimize the separation of data into clusters. It helps in identifying the most discriminative features that contribute the most to cluster formation.

2.1 Roteiro

2.2 Código

# Carregando pacotes
library(tidyverse)
library(dplyr)
library(cluster)
library(infotheo)
# library(reticulate) 

2.2.1 Conjunto de dados

load("glioma.RData") # geneInfo ; gliomaGSE52009 ; targetInfoGlioma
glioma <- gliomaGSE52009[1:500,]; as.data.frame(glioma)
info <- targetInfoGlioma; rownames(info) <- NULL; info |> select(colnames(info[,-1]),FileName)
str(info[,-1]) # ignorando a coluna FileName
'data.frame':   120 obs. of  7 variables:
 $ gender      : chr  "female" "female" "female" "unknown" ...
 $ age         : num  37 61 37 -100 31 42 42 38 32 40 ...
 $ geoAccession: chr  "GSM1257398" "GSM1257399" "GSM1257400" "GSM1257401" ...
 $ sampleInfo  : chr  "Astrocytoma.a-0253" "Astrocytoma.a-0258" "Astrocytoma.a-0281" "Astrocytoma.a-0285" ...
 $ diagnostic  : chr  "astrocytoma" "astrocytoma" "astrocytoma" "astrocytoma" ...
 $ datasetId   : chr  "Glioma52009" "Glioma52009" "Glioma52009" "Glioma52009" ...
 $ tissue      : chr  "brain" "brain" "brain" "brain" ...
info$gender <- factor(info$gender); levels(info$gender)
 "female"  "male"    "unknown"
ggplot(info, aes(x=reorder(gender, -table(gender)[gender])))+
  geom_bar(aes(fill=gender), color="transparent")+
  scale_fill_manual(values=c(male="#3E67A3",female="#A34336",unknown="#7F8F85"))+
  geom_text(stat = 'count', aes(label = paste0(round((after_stat(count)/sum(after_stat(count)))*100), "%")), vjust = -0.5) +
  scale_y_continuous(expand = expansion(mult=c(0,.2)))+
  labs(x=NULL,y=NULL, title="Gênero")+
  guides(fill="none")

info$diagnostic <- factor(info$diagnostic); levels(info$diagnostic)
 "anaplastic.astrocytoma"       "anaplastic.oligodendrocytoma"
 "anaplastic.oligodendroglioma" "astrocytoma"                 
 "glioblastoma"                 "oligodendroglioma"           
ggplot(info, aes(x=reorder(diagnostic, -table(diagnostic)[diagnostic])))+
  geom_bar(aes(fill=diagnostic), color="transparent")+
  scale_fill_grey(end = 0.9, start=.5)+
  geom_text(stat = 'count', aes(label = paste0(round((after_stat(count)/sum(after_stat(count)))*100), "%")), vjust = -0.5) +
  scale_y_continuous(expand = expansion(mult=c(0,.15)))+
  labs(x=NULL,y=NULL, title="Diagnóstico")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1, vjust=1))+
  guides(fill="none")

range(info$age)
 -100   70
cat(paste0(sum(info$age<=0)," entradas inválidas para idade  => ",round(sum(info$age<=0)/nrow(info)*100,2), "% da amostra")) # Porcentagem de dados inválidos para idade
17 entradas inválidas para idade  => 14.17% da amostra
range(info[which(info$age>0),]$age)
 17 70
ggplot(subset(info[which(info$age>0),]), aes(x = age, y = after_stat(density))) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "#0c0c0c", binwidth = 5, alpha = 0.9) +
  geom_density(color = "cyan", linetype = "solid", linewidth = 1, fill="transparent") +
  labs(x = "Idade", y = "Frequência relativa", subtitle ="(somente dados válidos)", title="Distribuição de idade na amostra")+
  scale_y_continuous(expand = expansion(mult=c(0,.20)))+
  scale_x_continuous(n.breaks = 20)

unique(info$datasetId)
 "Glioma52009"
unique(info$tissue)
 "brain"

2.2.2 Agrupamento dos genes

Aplica-se um agrupamento hierárquico para simplificar conjunto de genes que apresentam mesma intensidade de luminosidade para cada indivíduo. Dessa forma, e levando em consideração que os dados são coletados sob a mesma medida, opta-se por não padronizá-los. Caso fossem padronizados, alguns intensidades não expressivas poderiam ser conectadas com outras mais expressivas originalmente.

d<-dist(glioma,method = "euclidean") # calculando a distância euclidiana entre cada gene

Average linkage: It is the average distance between each point in one cluster to every point in the other cluster

Centroid linkage: The distance between the center point in one cluster to the center point in the other cluster

Ward’s linkage: A combination of average and centroid methods. The within cluster variance is calculated by determining the center point of the cluster and the distance of the observations from the center. While trying to merge two clusters, the variance is found between the clusters and the clusters are merged whose variance is less compared to the other combination.

hc<-hclust(d ,method="ward.D") # Hierarchical Cluster com linkage por Ward.D
num_cluster_init <- 15; init_clusters = cutree(hc, num_cluster_init) # pegando os clusters iniciais 
# Exemplo do um cluster selecionado inicialmente
as.data.frame(glioma[which(init_clusters==1),])
genes.clusters <- list()
for(clust in 1:num_cluster_init){
  gp <- glioma[which(init_clusters==clust),] # selecionando o cluster
  
  bounds <- apply(gp,MARGIN = 2, FUN = function(x) quantile(x,c(0.15,0.85))) # definindo os intervalos interquartil para cada indivíduo nesse conjunto de genes
  
  # identificando cada intensidade dentro do cluster se está dentro do interquartil do indíviduo
  in_bound <- as.data.frame(lapply(1:ncol(gp), FUN= function(i) {as.numeric(between(gp[,i],bounds[1,i],bounds[2,i]))})); colnames(in_bound) <- colnames(bounds); rownames(in_bound) <- rownames(gp)
  
  # manter no cluster apenas os genes que se apresentaram bastante no todo da amostra dentro do interquartil da intensidade de cada indivíduo 
  select_genes <- apply(in_bound, MARGIN = 1, function(x) sum(x)) >= 0.7*ncol(glioma)
  select_names <- names(which(select_genes==T))
  if(length(select_names)>0){
    genes.clusters <- append(genes.clusters,list(select_names))
  }
}

for(i in 1:length(genes.clusters)){
  if(i==1) cat("--- Genes:\n")
  cat(paste0("\nCluster ",i,":\n"))
  cat(genes.clusters[[i]],sep=" | ")
  cat("\n")
}
--- Genes:

Cluster 1:
10011 | 100130856 | 10048 | 10067 | 10073 | 10075 | 10096 | 10097 | 10101 | 10103 | 10190 | 10228 | 10237 | 10238 | 10294 | 10299 | 103 | 10329 | 10400 | 10401 | 10445 | 10469

Cluster 2:
10 | 10009 | 100128782 | 100128922 | 100129138 | 100130557 | 100131816 | 100132319 | 100192378 | 10069 | 100859930 | 10126 | 10127 | 10139 | 10144 | 10157 | 10162 | 10165 | 10178 | 1018 | 10196 | 10201 | 10241 | 10256 | 10298 | 10326 | 10352 | 10385 | 1039 | 104 | 1040 | 10451 | 10464 | 10465 | 10481 | 10491 | 10495

Cluster 3:
100130175 | 10020 | 10079 | 10081 | 10084 | 10147 | 10194 | 10195 | 10211 | 10270 | 10277 | 10291 | 10327 | 10423 | 10443 | 10473 | 10478 | 10483 | 10488 | 10493

Cluster 4:
10000 | 100128398 | 100129503 | 100129794 | 100130815 | 100131227 | 100131756 | 10017 | 10019 | 100192379 | 10022 | 1003 | 10039 | 1004 | 10044 | 10050 | 10076 | 1010 | 10107 | 10125 | 10198 | 10242 | 10246 | 10254 | 10272 | 10311 | 10316 | 10325 | 10390 | 10406 | 10426 | 10462 | 1048

Cluster 5:
100009676 | 100127983 | 100128816 | 100128893 | 100130071 | 100130264 | 100131131 | 100131454 | 100132356 | 100141515 | 10018 | 10047 | 100506742 | 10053 | 10060 | 10077 | 10117 | 1014 | 10158 | 10205 | 10223 | 10249 | 10260 | 10324 | 10331 | 10333 | 10343 | 10344 | 10345 | 1038 | 10389 | 10499

Cluster 6:
10002 | 10003 | 10004 | 100124700 | 100128236 | 100128327 | 100128979 | 100129033 | 100129066 | 100129637 | 100129935 | 100131439 | 100131510 | 100169752 | 100190949 | 100288254 | 10045 | 10114 | 10143 | 1015 | 10207 | 10218 | 10257 | 10317

Cluster 7:
10025 | 10042 | 10061 | 10128 | 10200 | 10365 | 10434 | 10454 | 10466

Cluster 8:
10010 | 10057 | 10138 | 10208 | 10210 | 10217 | 10229 | 10239 | 10243 | 10363 | 10393 | 10411 | 10444 | 10463 | 10479 | 10484 | 10497

Cluster 9:
100127886 | 100128124 | 100128843 | 100129434 | 100129455 | 100129726 | 100134713

Cluster 10:
100128731 | 10062 | 10130 | 10159 | 10236 | 10240 | 10284 | 10289 | 10296 | 10424 | 10428 | 10474 | 10477 | 10490 | 10494

Cluster 11:
100128775 | 10109 | 10169 | 10248 | 10330 | 103910 | 10399 | 10476

Cluster 12:
10013 | 10038 | 10046 | 10056 | 10121 | 10131 | 10174 | 10175 | 10199 | 10206 | 1022 | 10244 | 10283 | 10285 | 10286 | 10313 | 10342 | 10432 | 10436 | 10492

Cluster 13:
10092 | 10102 | 10213 | 10269 | 10328 | 10425 | 10430 | 10440 | 10456 | 10480

Cluster 14:
1012 | 1016

2.2.3 Distribuição unificada por cluster

Agora que os genes já foram agrupados por similaridade, visualiza-se o valor da informação mútua entre e intra os grupos com os genes selecionados. Para isso, será necessário discretizar as entradas para então obter as distribuições empirícas de cada gene.

https://search.r-project.org/CRAN/refmans/infotheo/html/multiinformation.html

breaks <- seq(floor(min(glioma)), ceiling(max(glioma)), by=1) # definindo intervalos para discretizar

mutualinfo_clust <- list()
for(clust in 1:length(genes.clusters)){
  cluster_disc_temp <- apply(glioma[genes.clusters[[clust]],], MARGIN = c(1,2),function(x) cut(x, breaks)) # discretizando cluster
  
  mutualinfo_clust <- append(mutualinfo_clust,multiinformation(t(cluster_disc_temp))) # Correlação total (em nats - quando usado logaritmo natural para entropia)
}

for(i in 1:length(genes.clusters)){
  if(i==1) cat("--- Informação Mútua (em nats):\n")
  cat(paste0("\nCluster ",i,":\t ",format(round(mutualinfo_clust[[i]],2),width = 6, nsmall = 2)))
}
--- Informação Mútua (em nats):

Cluster 1:    15.22
Cluster 2:    31.26
Cluster 3:    13.12
Cluster 4:    15.56
Cluster 5:    12.73
Cluster 6:     1.09
Cluster 7:     3.47
Cluster 8:    11.88
Cluster 9:     2.70
Cluster 10:    9.62
Cluster 11:    2.22
Cluster 12:   13.51
Cluster 13:    3.88
Cluster 14:    0.56

2.2.4 Discriminação de características

3 Conclusão